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Recent development and perspectives of machines for lattice QCD 
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a Department of Physics, University of Wuppertal, 42097 Wuppcrtal, Germany 

I am going to highlight recent progress in cluster computer technology and to assess status and prospects of 
cluster computers for lattice QCD with respect to the development of QCDOC and apeNEXT. Taking the LatFor 
test case, I specify a 512-processor QCD-cluster better than l$/Mflops. 
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1. INTRODUCTION 

Driven by the ever increasing demand of lattice 
QCD for compute power, Computer Science has 
become a serious activity of its own for many re- 
search groups, with a proven record of success. A 
variety of "home made" QCD engines is described 
in a long-standing series of "machine talks" from 
early Lattice Conferences on. 

In US, lattice physicists always were close to 
computer companies, e.g. IBM, where the first- 
generation GF11 project started in 1983 jll'2l.'{j . 
or TMC 1 1I5| . In Japan, cooperation of physics 
and computer science began in 1978 continuing 
with the QCDPAX series 0. In 1996, CPPACS, a 
long term leader of the TOP500 list, was built by 
CCP scientists together with computer industries 
|8I9| . Certainly, this symbiosis to a good deal has 
pushed Japan to the top position in HPC 10 . 

Soon fourth generation "home-made" systems 
will become operational: In the US, computer 
science activities at Columbia go back to 1982 
|llll2ll3| . The group has devised QCDSP in 
1999, a third generation QCD-computer |14| . and 
is about to finish the prototype of QCDOC, a 
highly scalable multi-Tflops system |15I16| in col- 
laboration with UKQCD and IBM. In Europe, 
INFN/Rome started with the first generation 
APE in 1984 Q7|. In 1993, the INFN/Pisa- 
Romc group has presented the second generation 
APE100 |18ll9j . followed by the third generation 
APEmille in 2001 [UJ. First CPUs for apeNEXT, 
designed by the Bcrlin-Pisa-Rome group (DESY- 
INFN) for a speed of several Tflops, are expected 
for autumn 2003 |2T] . 



A new HPC variety has entered the stage 
more recently |22I23| . Built from standard PC 
components, cluster computers can be read- 
ily adapted to lattice QCD [21]. They strive 
to win the QCD-computer contest for lowest 
price/performance ratios, claimed by both QC- 
DOC and apeNEXT with a sustained perfor- 
mance of l$/Mflops (Mflop/s) for double preci- 
sion Wilson fermion computations in 2004. 

I have been asked to highlight recent progress in 
cluster computer technology and to assess the op- 
portunities of QCD-clusters and home-made sys- 
tems to win the contest. To this end I choose the 
LatFor test case [21] and consider two cost func- 
tions, the price/performance ratio R for invest- 
ment costs and the waste heat H for cost of oper- 
ation. Based on performance results given in sec- 
tion[Sl I will specify a 512-processor QCD-cluster 
with R = l$/Mflops and H « 0.12W/Mflops. 

In sections El and 01 I discuss general and QCD- 
optimized clusters. Recent PC hardware develop- 
ments are presented in section 01 The status of 
QCDOC and apeNEXT is given in section [fjl 

2. RISE OF CLUSTER COMPUTING 

Table ^ illustrates the increasing presence of 
cluster computers in the TOP500 list, which is 
sorted according to Linpack benchmark results of 
the most powerful computing systems worldwide 
0. The TOP500 group defines a cluster as paral- 
lel computer where the number of nodes is larger 
than number of processors per node. If the num- 
ber of nodes is less than the number of processors 
per node the system is termed constellation. The 
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Table 1 

Percentage of cluster computers in the TOP500 
list. 
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Figure 1. Computer class distribution in the 
TOP500 list. 
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Figure 2. Co-evolution of CPU clock rate and 
network speed. 



occurred when PC prices went down and low la- 
tency switches, high level message passing stan- 
dards like MPI and reliable communication soft- 
ware became available. 

Clusters bear several advantages: they are built 
from cost-effective components, their modularity 
allows flexible hardware upgrades, they benefit 
from (OpenSource) software standards and they 
can be optimized for many specific applications. 



main fraction in the TOP500 list consists of sin- 
gle node MPPs while non-clustered SMPs have 
nearly vanished (2%), cf. figure ^ Among the 
first 100 entries, 33 clusters are found in US, 3 in 
France, 2 in Sweden, and 1 in Australia, Canada, 
China, Germany, Russia, and UK, respectively. 

Cluster computing started with Beowulf sys- 
tems in 1994 [2B . But it was not before the ad- 
vent of networks with gigabit point-to-point per- 
formance like Myrinet that clusters could become 
competitive. While CPU clock rates grow more 
or less continuously, doubling every 21 months ac- 
cording to Moore's "Law" , performances of com- 
modity networks tend to increase in a step-wise 
fashion. As a rule of thumb, many HPC appli- 
cations ask for a network speed of 1 Gbit/s per 
1 GHz clock speed of a node. Fig. [21 demon- 
strates that this matching point was reached 
around 1999 [27] . The breakthrough of clusters 



3. QCD-OPTIMIZED HPC-CLUSTERS 

QCD-Cluster computing was pioneered by 
Gottlieb in 1998. He built the "Candycane" Be- 
owulf from 32 350 MHz Pentiumll PCs. Other 
early systems followed soon, see table H 

bmce 

2002, quite a few QCD-clusters have been in- 
stalled. Still, the number of systems and their 
individual sizes are small compared to the gen- 
eral purpose clusters of the TOP500 list: 

• Bielefeld [23 (2003) 

- 16 dual XEON, 2.4 GHz 

- with switched GigE 

- 16 dual Athlon MP 1800 

- with Myrinet2000 

• Bern [H] (2003) 

- 32 dual XEON, 2.4 GHz 

- Intel E7500 chipset 

- DDR RAM 

- Myrinet2000, 2 x 190 MB/s bi-dir bw 
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Table 2 



"Early" QCD-clusters. 



site 


name 


year 


# of procs 


CPU type 


clock [MHz] 


net 




Indiana State 


Candycane 


1998 


32 


PII 


350 


fast ethernet 




Eotvos /Budapest 


PMS 


1998/1999 


32/64 


K6-2 


450 


ISA 2D 


EH 


Wuppertal 


ALiCE 


1999 


128 


Alpha 21264 


616 


Myrinet 


EH 


Jlab 


Calico 


2000 


16 + 18 


Alpha 21264 


667 


Ethernet 


E2 


Adelaide 


ORION 


2000 


40 x4 


SUN E420R 




Myrinet2000 


E2 


FNAL 


QCD80 


2000 


80 


PHI dual 


700 


Myrinet2000 


E3 


Zhongshan / Guangzhou 




2000 


10 


PHI dual 


500 


fast ethcrnet 


M 



• DESY E7| (2002) 

- 16+16 dual XEON, 1.7/2.0 GHz (Hamburg) 

- 16 dual XEON, 1.7 GHz (Zeuthen) 

- Supcrmicro P4DC6 

- 1 GB RDRAM per node 

- Myrinet2000 

• FNAL 38 L (2002) 

- 128 dual XEON, 2.4 GHz 

- 48 dual XEON, 2.4 GHz 

- Supermicro P4DPR-6GM+ 

- Intel E7500 chipset 

- 128 + 48 GB DDR RAM 

- Myrinet2000, 2 x 135 MB/s bi-dir bw 

- GigE mesh on 16 nodes 

• Jlab El] (2002) 

- 128 single XEON, 2.0 GHz 

- Intel E7500 chipset 

- 65 GB DDR RAM 

- Myrinet2000 

• Seoul (2003) 

- 30 P4, 2.4 GHz 

- 16 GB DDR RAM 

- Fast ethernet 

• Taipei gO] (2003) 

- 30 P4, 1.6/2.0 GHz, Farm 

- RDRAM tuned for overlap simulations 

• Tsukuba (CCP) |Hj (2003) 

- 16 dual XEON, 2.8 GHz 

- 64 GB DDR RAM 

- Myrinet2000 

In early 2002, the Budapest group has carried out 
first runs with the Poor Man's Supercomputer v. 3 
(PMS v.3). The Budapest Architecture was the 
first large cluster system to use a Gigabit ethernet 
mesh as connectivity: 



Figure 3. GigE mesh of the Budapest PMSv.3 
cluster (from g2|)- 



• Budapest g2| (1/2002) 

- PMSv.3 

- 128 P4, 1.7 GHz 

- Intel GBD MB 

- 512 MB/node RDRAM 

- 4 x SMC 9452 Gbit-NIC 

- PCI 32bit/33 MHz 

The Budapest Architecture can be reconfigured 
to smaller partitions by re-wiring of the GigE- 
mesh, see figure PMSv.3 achieves a price per- 
formance ratio of less than l$/Mflops sustained 
for single precision Wilson fermion matrix inver- 
sions. This ratio has been assessed in 1/2002 
by pricing data quoted at www.pricewatch.com 
adding 10 %. Fig.g]shows node performances for 
an optimal number of nodes constrained by the 
available memory on PMSv.3. The MILC HMC 
code was optimized by SSE constructs for time- 
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Figure 4. Performances of Wilson and Staggered 
fcrmion matrix inversion on the PMSv.3 (from 

02). 



critical code blocks 1 . 

PMSv.3 has demonstrated that sophisticated 
networks can be avoided on streamlined QCD- 
clusters, which otherwise eat up a substantial part 
of the available budget. 

The Budapest Architecture is a model for 
QCD-clusters (i) providing high single node per- 
formances, (m) delivering sufficient network per- 
formance at low costs, (Hi) and being scalable. 

4. CLUSTER HARDWARE TRENDS 

The efficiency of parallel computers is deter- 
mined both through the local efficiency of the 
compute nodes and the performance of the com- 
munication network. In particular, the speed of 
the network interface, i.e. the speed of the PCI 
sub-system, is a key parameter to benchmark 
present commodity hardware. 

lr The use of the multi media extension (MMX) for AMD 
K6-2 has been suggested in 1999 HJ for PMSv.l and 
was subsequently used in finite density QCD computa- 
tions 43 . At the same time, M. Liischer has presented 
fast SSE coding on Intel platforms 1241 . 



4.1. CPUs 

Let's concentrate on PC processors that are 
currently relevant for cluster computing: in Q2 
2003, clock frequencies of Intel P4 and XEON 
CPUs have reached more than 3 GHz; AMD 
Athlon and Opteron CPUs have touched the 2 
GHz threshold 2 ; The Athlon64 CPU appeared in 
Q3 2003. While it's safe to say that the devel- 
opment of CPU clock speeds will follow Moore's 
"Law" , it is of course difficult to predict the de- 
tailed evolution of CPUs and chipsets, even for 
the near future. In Fig. 5, I have tried to collect 
the information made public by Intel and AMD. 
According to these numbers, Intel P4 and XEON 
processors will approach 3.4 GHz near the end of 
2003 while CPU speeds of more than 3.6 GHz can- 
not be envisaged before Q2 2004. The 1.5 GHz 
Itanium2 chip appears to be with us for quite a 
while. A successor to the Pentium 4, called "next 
generation" processor, might be expected in the 
second half of 2004. Further details on Intel and 
Opteron processors cannot be given here. 

4.2. Memory and front side bus 

QCD computations are largely determined by 
the memory-to-cache data rate available on the 
given chipset. A key figure is the frequency of 
the so-called front side bus (FSB). The FSB con- 
nects the processor to the north-bridge, the mem- 
ory controler hub (MCH) . The memory frequency 
itself must match the FSB frequency for maximal 
bandwidths. To give an example, the 800 MHz 
FSB requires 400 MHz dual channel DDR RAM 
(PC3200) to be fully saturated. 

Let's clarify the nomenclature: The acronym 
DDR stands for "double data rate" exploiting 
both the rising and falling flanks of the signal 
unlike standard SDRAM. Such memory type in 
principle delivers a data rate D = 8 x 2 x / B/s. 
As a next step, the dual channel memory control- 
ling technology has been introduced which allows 
to double D once more by means of logical words 
of length 144 bits (2 x (64 + 8)) that are split over 
two memory banks. In other words, a dual chan- 
nel twin module mimicks an effective frequency 
of 2/ MHz. With respect to the above example, 

2 The number tags of AMD Athlons mimick the 
performance-equivalent clock rate of Intel chips. 
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Figure 5. Clock frequency road-map (status Q2 2003). 
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/ = 400 MHz is the effective frequency of a twin 
module with D = 6.4 GB/s. 

A further increase in DDR memory frequency 
is expected for Ql 2004 with the appearance of 
667 MHz DDRII dual channel SDRAM. Recently, 
RAMBUS announced XDR DRAM running at a 
speed from 3.2 to 6.4 GHz with D = 6.4 and 12.8 
GB/s per channel 031- But keep in mind that 
RAMBUS memory tends to be nearly twice as 
expensive as equivalent DDR memory. 

The STREAM benchmark is a reliable estimate 
for the actual data rate that can be achieved on 
a given system. Fig. shows results of STREAM 
for a variety of platforms I n terms of the 

maximal bandwidth, STREAM gives about 87% 
on a 2.66 GHz P4 platform equipped with 200 
MHz DDR RAM on a 533 MHz FSB, for example. 
The STREAM benchmark is clearly dominated 
by the NEC SX-5 vector system. 

At this stage let me note that the so-called 
"machine balance" 3 B, i.e. the ratio of Mflops vs. 
the memory accesses in Mwords/s, has increased 
for PC hardware in recent years. While Intel 486 
boards had B close to 1 — a value today main- 
tained on vector systems like the SX-5 only — 
Pentium P4 boards show B = 10 . . . 20. Conse- 
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3 _B should better be called machine imbalance. 



Figure 6. Comparison of STREAM "Triad" 
benchmark results on a variety of computer plat- 
forms (taken from |46| ) . 

quently, a 1.7 GHz P4 CPU saturates 2 channel 
RAMBUS 400 MHz RAM with a maximal data 
rate of 1.6 GB/s for SSE-boosted Wilson fermion 
codes. It currently appears to be less cost efficient 
to choose CPUs with highest frequency. 

B is even more unfavorable for XEON dual 
processor systems as the FSB capacity is shared 
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among the processors. Given a maximal FSB fre- 
quency of 533 MHz (Q3 2003) B is nearly 3 times 
larger than for fastest Pentium P4 boards with 
800 MHz FSB, which were available already in 
Q2 2003. 800 MHz boards for the XEON proces- 
sor will not become available before Q2 2004, still 
the difference to P4 will be a factor of two 4 . 

AMD currently supports a FSB frequency of 
400 MHz for the Athlon processor while the mem- 
ory connection to the AMD Opteron processor 
is enabled through an internal memory controler 
with direct memory access. The advantage is that 
B is constant for single, dual or quad Opteron 
systems. In other words, Opteron is scalable. 

4.3. PCI 

PCI is a hardware standard to connect PCs 
with external devices. The speed of PCI is the 
bottleneck dominating the performance of the in- 
terconnectivity of cluster computers. We have 
witnessed several improvement steps since 1993 
through which PCI evolved from a 32bit/33MHz 
bus to 64bit/133MHz PCI-X in 1999. How- 
ever, one should be aware that 64bit PCI bus 
widths are not supported on standard PC boards. 
Clearly, the theoretical bi-directional Gigabit- 
Ethernet performance of 2 Gbit/s cannot be 
served adequately by a 32bit/33MHz PCI bus. 
Thus, already for Gigabit-Ethernet we encounter 
a 2:1 PCI-bus over-booking on standard PC 
boards. Myrinet2000 with a bi-directional band- 
width of 4 Gbit requires at least a 64bit /66MHz 
PCI bus 5 . 

A new standard, PCI-Express, is about to enter 
the market early in 2004. PCI-Express is a funda- 
mental re-design as compared to PCI-X. Instead 
of a parallel 64bit bus, PCI-Express is based on 
a serial bus with several channels (lanes). The 
performance per lane will be 2.5 Gbit/s, up to 32 
lanes are possible. With PCI-Express cluster nets 
will enter the O(100) Gbit/s era @B1- 

It is well known that early P4 and XEON 

4 The catch is the weak PCI bus of PC boards. 
5 To my knowledge there are is only one genuine P4 board 
with a FSB above 500 MHz supporting PCI-X, while 
many boards for dual XEON, Athlon and Opteron pro- 
cessors meanwhile are equipped with PCI-X. The Tyan 
2726 XEON board even supports 4 on-board GigE slots 
on 2 PCI-X channels H7I . 
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Figure 7. Myrinet aggregate bi-directional band- 
width on a XEON system for the PALLAS MPI- 
benchmark (taken from |38j ) . 



board chipsets were delivering much less PCI- 
bandwidth than promised by specifications. On 
today's Intel E750x and Serverworks GC chipsets 
such performance degradations have been over- 
come |49) . 

4.4. Network technology 

Cluster pioneer Myricom presented "Myrinet" 
with 2 Gbit /s bi-directional bandwidth already in 
1997 and has evolved the product to Myrinet2000 
with bi-directional bandwidth of 4 Gbit/s. Fig. El 
shows the aggregate MPI-bandwidth as a func- 
tion of the message length, using the genuine 
Myrinet communication driver GM 1.5.3 on the 
FNAL systems, cf. sectional As just announced, 
the maximal bi-directional bandwidth can reach 
950 MB/s (two channels). The latency for the 
PALLAS MPI-benchmark, i.e. the half of the 
zero-message length round-trip time, is 5 /xs |50| . 
Switches are available for up to 128 ports in a sin- 
gle cabinet. They can be combined to multi-stage 
crossbars with thousands of ports. The switch la- 
tency lies in the range of O(100) ns per stage. 

Another major advance in cluster network per- 
formance has been achieved with the novel In- 
finiband standard. Infmiband is designed for a 
bandwidth of 10 Gbit/s. First performance mea- 
surements can be found in Ref. [SI] (Fig. EJ • The 
latency for zero-message length is supposed to be 
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Figure 8. Comparison of aggregate bi-directional 
bandwidths (PALLAS MPI-benchmark) for In- 
finiband, Myrinet2000 and QsNET (Quadrics) 
(taken from |51|L 
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about 7 /is. The Infiniband road-map is shown 
in Fig. El With PCI-Express expected for early 
2004, Infiniband networks will deliver up to 20 
Gbit/s bi-directional bandwidth. Currently 96- 
port switches are available that can be combined 
to a larger multi-stage crossbar. The additional 
latency per switch stage is reported to be about 
200 ns. 

While the performances of Myrinet, QsNET 
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Figure 10. ParaStation TCP/IP bandwidth im- 
provement vs. standard TCP/IP under Linux on 
a 2.6 GHz, 400 MHz FSB XEON system. 



and Infiniband are impressive, the costs are sub- 
stantial. A Myrinet2000 interface card costs 
$1000 and a switch port about $400 on average. 
Mellanox Infiniband lies in the same price range 
(Q3 2003). With about $1400 per node, network- 
ing costs surpass the costs for the compute nodes. 
At this stage, these sophisticated networks appear 
to be reserved for high-end general purpose clus- 
ter systems. 

In order to provide cheaper and faster commu- 
nication the FNAL group has constructed own 
Gigabit-Ethernet network cards based on FPGAs 
. As these cards support up to 8 ports one 
can arrange the processing elements in form of 
a hypercube. Currently, the card is designed for 
32bit/33MHz PCI, a PCI-X version is planned. 
Still, the costs exceed $500 per card. 

On the other hand, standard GigE PCI cards 
cost about $40, while dual and quad cards 
amount to $150 and $400, respectively. I already 
mentioned a system supporting up to four GigE 
ports on board, hence PCI cards aren't required 
at all in case of a Budapest Architecture. 

How large a bandwidth can we squeeze out of 
a point-to-point GigE connection? The answer 
is largely dependent on the TCP/IP driver used. 
Standard drivers allow for somewhat more than 
100 MB/s bi-directional bandwidth. Commu- 
nication optimized drivers like ParaStation |53j 
can provide a much larger bandwidth. On a 
XEON system (64bit/66MHz PCI), ParaStation 
TCP/IP reaches up to 200 MB/s bi-directional 
bandwidth, see Fig. ^| Of course, one would 
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wish to achieve an aggregate bandwidth of 800 
MB/s on a GigE mesh. This requires, first of all, 
systems with PCI-X, and in order to achieve max- 
imal bandwidth, the communication software has 
to drive four network cards simultaneously. 

While meshes or grids are scalable with respect 
to nearest-neighbor computations, more compli- 
cated QCD applications require a switched net- 
work. Quite recently, level 3 enabled routed 
GigE switches appeared with 0(500) ports like 
the Myrinet GigE switch H2J, the CISCO Cata- 
lyst 6500 |3] or the ForcelO E series [SHI- The 
costs per port came down within the last half 
year (about $300 per port in Q3 2003). Certainly, 
the bi-directional bandwidth will not exceed 200 
MB/s for switched GigE connections. In fact, 
most switches are overbooked. The additional 
latency of the Myrinet switch is 3.5 [is, CISCO 
Catalyst 6500 adds between 12 and 16 /is while 
the FORCE10 switch is reported to give 23 [AS. 

Table 01 presents throughputs and latencies of 
the various PCI-based cluster connectivities. 

A comment: clusters with hybrid networks, i.e. 
merging a mesh with a switched system, appear 
to be quite an effective solution for non-nearest- 
neighbor QCD computations. In that case, 
nearest-neighbor communication can be routed 
over the mesh, non-nearest-neighbor communica- 
tion tasks are routed through the switch. 

4.5. Middleware 

One should not forget stability and admin- 
istration of clusters. These issues, which be- 
come crucial on large systems, are the domain 
of cluster middleware like SCore [SSj or Para- 



Table 3 



Network characteristics. 



Net 


bw bi-dir 


latency per stage 




Infmiband 


20 Gbit/s 


7 fjs 200 ns 


EH 


QsNET 


5.44 Gbit/s 


2 [AS 




Myrinet 
(2003) 


4 Gbit/s 
8 Gbit/s 


5 lis 200 ns 


m 


GigE 
(ParaStation" 
(JLAB) 


2 Gbit/s 


27 [as 12-23 [is 

12 [AS 

12 [is 


57 



Station [53] . Besides error correction, package- 
loss-safety — though expensive to realize — is re- 
quired to achieve long term stability. Further- 
more, large systems need automatized adminis- 
tration tools which can take care for safe job ter- 
mination and system supervision. To give an ex- 
ample, the ParaSation middleware is based on a 
virtual machine/partition concept, that can pre- 
vent local instabilities from spreading. In this 
manner, we enjoy stable uptime periods of sev- 
eral months on the Wuppertal ALiCE cluster. 

5. PERFORMANCE AND SCALING OF 
QCD CODES 

5.1. Single node performance 

The acceleration of QCD codes on single CPUs 
is of primary concern in order to achieve a 
high parallel performance. We can benefit from 
Moore's "Law": Fig.lllldemonstrates the perfor- 
mance improvements gained through increases of 
processor frequencies for the matrix-vector mul- 
tiplication on a 16 3 x 16 lattice with the Wilson- 
Dirac operator using 1 processor per node (code 
by M. Hasenbusch) [25159160] . Performance crit- 




32 bit 64 bit 

Figure 11. Performance of the Wilson-Dirac mul- 
tiplication as function of the CPU clock at fixed 
lattice size 16 3 x 16 (adapted from and [6T7j). 
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ficiency one encounters typical gain factors — i.e. 
the gain seen when switching on the second pro- 
cessor and running programs in parallel — of 1.2 
to 1.4 on XEON systems with DDR RAM and 
1.6 for an early dual P4 RAMBUS platform. The 
small factor in case of XEON has been antici- 
pated below ( section I4.2JI as dual XEON proces- 
sors share the FSB. The records in local perfor- 
mances as of Q3 2003 are collected in Fig. 14. 

5.2. Parallel efficiency 

The performance per processor will decrease for 
parallel operation. On the PMSv.3 with GigE 
connectivity, the degradation is about a factor 
of 2 for both staggered and Wilson fermions, cf. 
Fig. El The parallel efficiency, determined keep- 
ing the local lattice size constant for single and 
parallel mode, is listed in table 0J 6 On Myrinet 
clusters, typically more than 65% efficiency are 
achieved for both SSE and non-SSE coding. On 
the Wuppertal XEON cluster PAN (2.6 GHz, 
Myrinet) non-blocking communication is enabled 
under MPI by virtue of ParaStation. Hence, the 
communication can be hidden behind computa- 
tion leading to an efficiency of 0.91. 

6 The i860 chipset shows a smaller parallel efficiency due 
to the defective PCI implementation, mentioned earlier 
(section 14.31 . 



0.1 1 

Lattice Size [MB] 



Figure 13. Demonstration of successive perfor- 
mance optimization on the Pentium 4 with 800 
MHz FSB for staggered fermions |52"|. 



ical parts of the Wilson-Dirac kernel are acceler- 
ated by SSE and SSE2 (streaming SIMD exten- 
sion) constructs as described in Ref. |24| . 

Fig. El shows the dependency of the MILC 
staggered fermion code (32 bit) performance on 
the CPU clock frequency. Successive performance 
improvements are illustrated in Fig. 1131 62 . 

Note that XEON (1 processor of 2, 533 MHz 
FSB) and P4 (1 processor of 1, 800 MHz FSB) 
performances differ by a factor slightly less than 
the FSB frequency ratio. As to the dual node ef- 



Table 4 

Parallel performances and scaled efficiency [55] ■ 

system single proc. parallel efficiency 

[Mflops] [MHops/proc] 

Myrinet 57!) 307 053 

i860, SSE 

Myrinet GM 631 432 0.68 

E7500, SSE 

Myrinet 

Parastation 675 446 0.66 

E7500, SSE 

Myrinet 

Parastation 406 368 0.91 

E7500,non-SSE 

non-blocking 

Gigabit 

Ethernet 390 228 0.58 

non-SSE 

Infiniband 370 297 0.80 

non-SSE 



10 



"> m ° 
3 CD m 



12 , 32bit, 1.7 Ghz, IS B 400. RDRAM, ljof 2 | 1 .775 



same 32 bit,! 2 of2 



16 , 3 2b it, 3.06 Ghz, FSB5 33, RDRAM 



1.502 



sam e 64bit | 0.973 

same 64 b it , 2 of 2 | 0.786 



CO 

X 



16 , 3 2b it, 3 .06 Ghz, FSB 800, DDR 3200 



same 64 bit 
4- 



16^, 64bit, 1.7GHz, 1 of 2 | 



-goo 

S ?S I 32bit, 2.8Ghz, FSB800, DDR 466 I i 3 

o ui ' i 1 

(i) o cfl 



(T 



0.98 



1.554 



] 2.736 



] 2.736 



Gflops 



1 



Fig. 14. Single processor performance records (I thank S. Gottlieb, M. Hasenbusch, D. Holmgren, M. 
Liischer, and P. Wegner for their contributions.). 



Fig. El shows parallel single/dual speeds on 
the DESY XEON system, using M. Luscher's 
latest version of the e/o preconditioned Wilson- 
Dirac matrix-vector multiplication. The paral- 
lelization is 1-dimensional. With four processors 



on 2 nodes, a double precision performance of 
more than 1 Gflops per node could be achieved. 

Fig. 1161 gives an impression of the efficiency of 
the MILC staggered fermion code with fixed local 
lattice sizes on the 128 node dual XEON system 
at FNAL. 
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Figure 15. Parallel performances of double preci- 
sion e/o Wilson-Dirac matrix- vector multiplica- 
tions with SSE2 on 4 processors of the DESY 
cluster (M. Liischer, H. Wittig) [6lj . 



5.3. Scaling to massive parallelism 

One would like to exert as much CPU power 
as possible on a given lattice, as needed, e.g., 
for realistic turnaround times of dynamical over- 
lap fermion simulations. While QCDOC and 
apeNEXT, as shown below, are designed with 
respect to fine granularity, clusters favor coarse 
grained parallelism. 

Nevertheless, Fig. El demonstrates that Wil- 
son fermions with 11-SSOR preconditioning 
and non-blocking MPI can scale quite far on 
clusters. We have benchmarked a test lattice of 
size 12 4 . On 64 ALiCE processors we still achieve 
a speedup of about 32 using a 3-d processor ge- 
ometry. By extrapolation we would expect the 
code to run on 512 processors with a speedup of 
256 for a 16 3 x 32 lattice. 
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Figure 16. Efficiency of the MILC staggered 
fermion code on the FNAL dual XEON 128-node 
cluster. [52| . 
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Figure 17. Scaling of the Wuppertal HMC code 
with 11-SSOR preconditioned Wilson fcrmions for 
a 12 4 lattice on ALICE |64I65| . 



6. PROSPECTS 



both for apeNEXT [SZ| and QCDOC— , on the 
other hand, there are continuous changes in PC 
processor market. As evaluation criteria the cost 
functions price/performance ratio R for invest- 
ment and waste heat H for operation are used. 

6.1. QCDOC 

The development of QCDOC ("QCD on a 
chip" ) is documented in three proceedings of pre- 
vious lattice conferences (68|69|15] , In fact there 
is very good news: first successful floating point 
operations have been carried out on a prototype 
ASIC P2] in Q2 2003. 

The QCDOC CPU is based on a 500 MHz, 32 
bit PPC 440 core, with a 64-bit floating point 
unit of 1 Gflops peak, and 4 MB on-chip memory. 
The nearest-neighbor topology is a 6-d hypercube 
with an aggregate bandwidth of 12 Gbit/s (for 
12 directions) per processor. With 550 ns, the la- 
tency will be extremely small. A simulation of the 
processor gave a sustained performance of 50% of 
peak or 465 Mflops for the Wilson-Dirac operator 
on a 2 4 lattice (T. Wettig in |55|L 

Due to low latency and high local efficiency, 
QCDOC will be perfectly scalable and hence de- 
liver full compute power on small lattices. High 
performances require the use of assembler cod- 
ing. Peter Boyle's assembler generator will be an 
important asset of the machine. Large QCDOC 
systems are likely to be partitioned into smaller 
parts. Main physics targets are dynamical (chi- 
ral) fermion simulations with small quark masses. 

At the time of this conference first daughter- 
boards have been tested, a 128-node system is 
planned for autumn 2003, a 5k-node system 
should be finished end of 2003. In late spring 
2004, 5 Tflops sustained are planned for both 
UKQCD/Edinburgh and Riken and 2.5 Tflops 
sustained for Columbia University. Funding is 
aimed at a 10 Tflops sustained QCDOC system 
for the US-community (SCIDAC) [7U|. 



Let me try to assess the current prospects of 
clusters as compared to "home made" QCD com- 
puters. This task is difficult enough as, on one 
hand, the development process of home made 
systems often is delayed — about two years for 
APEmille and presumably nearly two years again 



6.2. apeNEXT 

Detailed information on the development of 
apeNEXT can be found in the proceedings of ear- 
lier lattice conferences |71I72I21] , 

The apeNEXT processor design has been fin- 
ished end of June 2003 [73j. The 200 MHz 64 
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bit CPU hosts a 64-bit floating point unit capa- 
ble of 1.6 Gfiops peak. The memory bandwidth 
is 3.2 GB/s. The topology is 3-dimensional with 
an aggregate nearest-neighbor performance of 1.2 
GB/s. The latency will be 0(100) ns and thus 
favor high scalability. The processor simulator 
achieves a sustained performance of 944 Mflops 
for the Wilson-Dirac operator (D. Pleiter in |25j). 
Therefore, 512 processors in a rack can deliver 
about 0.5 Tflops sustained. In addition to TAO a 
C compiler will be available. Note that the par- 
titions are quantized in units of An x 8 x 8 pro- 
cessors, n E N. 

A first processor prototype is expected for late 
2003. In early 2004, 256 nodes should be assem- 
bled. INFN plans for the funding of several sus- 
tained Tflops, while DESY and GSI (Germany) 
intend to install 15 and 10 Tflops peak, respec- 

tiveiy mm- 

6.3. A low-cost QCD-cluster 

Let's build our own cost-optimized QCD- 
cluster! We adopt the LatFor setting j2S] where 
dynamical Wilson fermions are simulated by 
HMC on a 32 3 x 64 lattice at a quark mass 
characterized through 0.3 < m^jvn p < 0.6. To 
achieve reasonable turnaround times, we aim for 
0.25 Tflops sustained. 

Recall that Luscher and Wittig got be- 
tween 380 and 490 Mflops/proc sustained per- 
formance on a dual XEON 2.0 GHz node under 
Myrinet2000 for local lattice sizes between lk and 
16k sites (section l5.2|l . On a 512-processor sys- 
tem, the local lattice for the LatFor test case is 4k 
sites. Hence it is reasonable to take the average of 
both numbers, i.e. 430 Mflops/proc, adding up to 
a total sustained performance of 0.22 Tflops. As 
connectivity we can choose a 2-dimensional GigE 
mesh of 32/2 x 64/2 = 512 processors since Fodor 
et al. have demonstrated that GigE meshes come 
close to Myrinet performances on the DESY ma- 
chine |4*2"] . 

Let us specify the following Gedanken-cluster 
(prices by www.pricewatch.com, Q2 2003): 

Mobo GA-8EGXR-PEC, 533FSB 

DDR-266, 6 PCI $210 
CPU 2 XEON 2.0GHz, 512K CACHE $258 



Mem 


1 GB dual DDR 266 MHz 


$119 


Case 


inch Power 500 W 


$55 


Disk 


EIDE 80 GB 


$66 


GigE 


4 x PCI cards 4 x $29 


$126 


Sum 


per dual node 


$834 



The waste heat, H, amounts to about 30 kW. 

6.4. Comparison 

Table. confronts cost functions and maximal 
processor numbers of QCDOC, apeNEXT and 
mesh cluster with respect to the LatFor test case. 



R First we extrapolate R to equal points in time, 
say 01/2005. R is likely to drop to 0.5 $/Mflops 
for the cluster system by then (Moore's "Law"). 
Hence, investment costs will favor a cluster in 
01/2005. 

H The cost of operation of the cluster as deter- 



mined through H will lie below $20,000 per year, 
assuming German electricity costs for major cus- 
tomers. Operating QCDOC and apeNEXT will 
be considerably cheaper by a factor of 10 and 5, 
respectively. Thus, costs of operation favor QC- 
DOC or apeNEXT. 



C With respect to the LatFor test case, the 
maximal number of processors, C = 512, that can 
be realized for a 2-d mesh geometry has been cho- 
sen. In contrast, apeNEXT is limited to C = 2048 
processors while QCDOC can deploy tens of thou- 
sands of processors. 

In order to improve on this situation for clus- 
ters, one can resort to a 3-dimensional geometry, 
which in principle allows for C = 8k. Of course, 



Table 5 

Cost functions for QCDOC, apeNEXT and Clus- 
ter with respect to the LatFor test case. P 
is the total performance in Tflops, R is the 
price/performance ratio in $/Mflops, H the waste 
heat in W/Mflops, and C the maximal number of 
processors. Performances are sustained. 



system 



year proc P R H 



C 



QCDOC 2004 512 0.238 «1 0.01 > 16& 
apeNEXT 2004/5 256 0.241 wl 0.02 2048 
Cluster 2003 512 0.220 «1 0.12 512 
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the scalability S might limit the performance for 
yet smaller numbers of processors. It is possible 
that cache- resident coding will help here |57|. 

As far as dynamical Overlap fermions are con- 
cerned, first simulations are likely to use lattices 
of size 16 3 x 32. Aiming at maximal throughput, 
one should be aware that the numbers of pro- 
cessors are limited to C = 128 for a 2-d cluster, 
C = lk for a 3-d cluster, C = lk for apeNEXT, or 
C = 8k for QCDOC. In other words, the 8k QC- 
DOC can simulate this specific Overlap fermion 
problem 4 times as fast as the 2-crate apeNEXT, 
8 times as fast as the 3-d cluster with lk proces- 
sors (assuming scalability) or 64 times as fast as 
the 2-d cluster with 128 processors. 

7. SUMMARY AND OUTLOOK 

The price/performance ratio of QCD-clusters 
has just crossed the R = l$/Mflops threshold, 
QCDOC and apeNEXT are supposed to deliver 
this ratio mid/end of 2004. The waste heat per 
Mflops, H, is about 10 times larger for clusters. 
Hence, the TCO for 5 years of operation turns out 
to be similar for QCDOC, apeNEXT and clusters. 

As far as simulations on small lattices are con- 
cerned, the attainable throughput depends on the 
compute power applicable which is determined by 
the dimensionality of the parallelization. This is 
an advantage of 3-d and 4-d network geometries. 

Clusters can be used for complicated actions 
if a switched network complements mesh or grid. 
They will further improve with respect to home 
made systems due to PCI-Express 7 , networks and 
improved communication software. 

Jefferson Lab has installed a GigE-mesh QCD- 
cluster these days with 256 dual XEON nodes 
arranged as a 4 x 8 x 8 grid, expected to deliver 
1 Gflops per node sustained for the Wilson-Dirac 
operator |HZ|- Wuppertal University is about to 
install a 1024 processor system combining a GigE 
mesh architecture with a switched network. 

At last, we should gauge all our efforts with 
respect to commercial supercomputers scheduled 
for 2$/Mflops sustained end of 2005. 

7 PCI-Express based co-processors like the 25 Gflops 
ClearSpeed™ CPU-array just announced 1771 might be 
promising PC accelerators. 
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